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ABSTRACT 

This research investigated the use of the Rasch 
simple logistic model in item and test calibration. Tests employing 
word, picture, symbol, and number analogies were administered to 
college students, high school students, civil service clerical 
employees, and clients of the Minnesota Division of Vocational 
Rehabilitation. The results suggest that Rasch item easiness 
estimates are invariant with respect to the ability of the 
calibrating sample when an adequate sample is employed. The 
invariance of the Rasch item easiness estimates was shown to be 
related to the goodness-of-f it of the items to the Rasch model. The 
deletion of items with low Rasch probabilities increased the 
invariance of the Rasch item easiness estimates. Estimates of the 
amount of ability indicated by the raw scores on a test (ability 
estimates) were also shown to be invariant with respect to the 
ability of the calibrating sample for tests of 25 or more items, even 
when relatively small samples were employed. (For related document, 
see TM 002 270.) (Author) 
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13. ABSTRACT 



This research investigated the use of the Rasch simple logistic model in item and 
test calibration. Tests employing word, picture, symbol, and number analogies 
were administered to college students, high school students, civil service clerical 
employees and clients of the Minnesota Division of Vocational Rehabilitation. The 
results suggest that Rasch item easiness estimates are invariant with respect to 
the ability of the calibrating sample when an adequate sample is employed. The 
‘invariance of the Rasch item easiness estimates was shown to be related to the good- 
ness-of-fit of the items to the Rasch model. The deletion of items with low Rasch 
probabilities increased the invariance of the Rasch item easiness estimates. Esti- 
mates of the amount of ability indicated by the raw scores on a test (ability 
estimates) were also shown to be invariant with respect to the ability of the cali- 
brating sample for tests of 25 or more items, even when relatively small samples 
were employed. 
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An Investigation of the Rasch Simple Logistic Model; 

Sample-Free Item and Test Calibration 

Howard E. A. Tinsley and Rene' V. Dawis 
University of Minnesota 

Gulliksen (1950) remarked over twenty years ago that the discovery of 
item parameters which would remain stable as the item analysis group changed 
would constitute a significant contribution to item analysis theory. More 
recently, Lord and Novick (1968) have stated a similar opinion. Within 
the framework of classical test theory, a number of indices of item dif- 
ficulty have been suggested which might possess this property. A normal 
curve transformation of P values to Z values, frequently referred to as 
Thurstone's method of absolute scaling, has been suggested by several authors 
(Bliss, il929; Guilford, 1954; Horst, 1933; Thorndike, Bergman, Cobb, and 
Woodyard, 1926; and Thurstone, 1925, 1947). A second method commonly sug- 
gested for obtaining invariant item difficulty parameters, the limen method, 
has been described by Bliss (1929), Thorndike et al..(1926), and Tucker 
(1952, see Angoff, 1960). Modifications of the limen method have been 
suggested by Gulliksen (1950) and Richardson (1936). Both the method of 
absolute scaling and the limen method require the assumption of a normal 
distribution for the ability under consideration. Although they were first 
described 50 years ago, neither method has been the subject of any system- 
atic research. 

In 1960, George Rasch introduced a model for the latent trait analysis 
of tests of intelligence or attainment; subsequent refinement of this model 
has continued (Rasch, 1960, 1961, 1966a, 1966b) : Wright (1967) has pointed 

out that use of the Rasch model makes possible sample-free item and test 
calibration. Item and test parameters can be computed from any sample of 
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subjects since the estimation o£ the parameters is independent of the 
distribution o£ ability in the calibrating sample. The purpose o£ this study 
was to investigate these claims. 

The Rasch model is a special case o£ the logistic model; a simplified 
case in which the parameter £or item discrimination is removed. The Rasch 
model makes the £ollowing assumptions: 

1. Items are scored dichotomous ly , 

2. Speed does not in£iuence the probability of a correct 
response , 

3. Given the parameters for item easiness (e) and subject 
ability (a), all responses on a test are stochastically 
Independent) and 

4. The probability of a correct response by individual i 
to item j is a function of the ratio a^/e^ . 

(Anderson, Kearney, and Everett, 1968; Brooks, 1965; and Sitgreaves , 1963). 
This last assumption excludes guessing and variations in item discrimination 
as factors which affect the probability of a correct response. Panchapakesan 
(1969) has shown, hovevec that the Rasch simple logistic model is robust in 
this respect. 

Although introduced in 1960, the Rasch simple logistic model has not 
been widely investigated. Two research designs have been employed in the 

study of item calibration by the Rasch model. In the single sample design 

* 

the goodness-of-fit of the item characteristic curve to the simple logistic 
model constitutes a test of the invariance of the item easiness estimates. 

(As Bock and Wood pointed out in 1971, only comparisons--contrasts or ratios- 
between items are meaningful because the sample-free rationale employs an 
arbitrary origin and unit of sc.^le . Only the relative difficulty of items 
can be expressed.) Generalizations from single sample studies are limited 
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to the range of abilities represented in the sample. In the two-sample 
design, the item parameters are estimated independently on data obtained from 
two samples of different ability. The two-sample design was employed in this 
research because it constitutes a more stringent test of the Rasch model. 

Item Calibration . To date the published literature contains reports of 
only three investigation; of item calibration using the Rasch model. Rasch 
(1960) used data from four subtests of the Danish Military Group Intelligence 
Test BPP which were given to 1094 Danish military recruits in September, 1953. 

He found the data fit his model for subtests N (a test of finding the next 
term in a numerical sequence) and L (a test similar to Raven's Progressive 
Matrices , but with groups of letters instead of geometric figures) . The model 
was inadequate to explain performance on subtests F (in which geometric shapes 
are to be decomposed into parts) and V (a test of verbal analogies). Rasch, 
however,. had used restrictive time limits with subtests F and V. When the 
time factor was controlled the data for these subtests also fitted his model 
(Rasch, 1966a). 

Brooks' (1965) research was designed to determine whether data obtained 
from American public school children with a group intelligence test would fit 
the Rasch model. Samples of 509 eighth graders and 544 tenth graders in 
Iowa Public Schools (all of whom had served as part of the standardization 
sample for the 1964 Lorge -Thorndike Intelligence Test) were employed in this 

study. The data for the eighth grade students were analyzed for all eight 

* 

subtests while the data for the tenth grade students were analyzed for only 
three subtests: verbal 3, written arithmetic problems, verbal 5, word analogies, 
and non-verbal 3, geometric form analogies. In all, 178 items were tested at 
the eighth grade level and 65 items were tested at the tenth grade level; 

177 (72.87.) of the -243 items tested fit the Rasch model, supporting the 
hypotheses that the Rasch model is appropriate for representing performance 
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on a standardized, multiple choice test of intellectual ability, and that 
Rasch item easiness estimates are invariant with respect to the ability of 
the calibrating sample. 

Brooks (1965) also investigated the invariance of item easiness estimates 
derived independently from two samples of differing ability. He reports the 
results of this analysis in terms of an I index, obtained by taking the 
square root of the mean of the squares of the perpendicular distance of the 
item points from the line dictated by the model. Brooks concludes that the 
points generally tended to fall along a straight line with unit slope but 
that these comparisons are somewhat difficult to evaluate. 

Among the hypotheses investigated by Anderson et al. (1968) were the 
following: 

1. Rasch item easiness estimates are independent of the 
ability of the calibrating sample, and 

2. Rasch item easiness estimates are more stable when 
items which fit the Rasch model are considered. 

The test used in this research was the 45-item spiral omnibus intelligence 
test, used for screening applicants who apply to join the Australian Army or 
Royal Australian Navy. One sample consisted of 608 recruit applicants to 
the Citizen Military Force (CMF) , a part-time system of military training. 

The second sample consisted of 874 recruit applicants to the Royal Australian 
Navy (RAN). This latter sample was actually composed of three types of 
examinees, 446 general service recruits, 129 reservists (the RAN equivalent 
of the CMF), and 279 recruits to the womens section of RAN. Twelve items 
were deleted for zero or 100% correct responses and the ability dimension was 
categorized into six levels which corresponded to cut off points used by the 
military. 
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The hypothesis that Rasch item easiness estimates are independent of 
the ability of the calibrating sample was first investigated using a single- 
sample design. For the CMP sample 30 (91%) of the items fit the Rasch model 
at the .01 level of confidence, 25 (76%) of the items fit the Rasch model 
at the more stringent .05 level of confidence. (The level of confidence 
represents the probability of obtaining the observed pattern of responses, 
assuming the Rasch model is adequate to explain performance on the item. A 
.01 level of confidence indicates that the observed pattern of responses 
would occur only one time in 100 for items which fit the Rasch model. Thus, 
the reverse of the normal situation occurs with the .05 level of confidence 
representing a more stringent criterion than the .01 level of confidence.) 

For the RAN sample the corresponding values were 22 (67%) and 16 (48%) . 

The auenors concluded that these results support the hypothesis for the 
range of abilities represented by the samples. 

Anderson, et al. (1968) also employed a ttro-sample design in investi- 
gating this hypothesis. This was accomplished by computing the product- 
moment correlation between the item easiness estimates obtained from the CMP 
and RAN samples. The authors concluded from the correlation of .958 that 
the item easiness estimates were independent of the ability of the eamples 
upon which they were computed. This correlation was based on all 33 items. 

Only those items satisfying the Rasch model, however, can be expected to 
possess the properties attributed to the model. Accordingly, when those items 
that failed to fit the Rasch model at the .05 level were deleted, a correlation 
of .990 was obtained between the remaining item easiness estimates. This 
compares favorably with the correlation of .958 obtained when comparing all 
items . 

Test Calibration . Only two investigations have been published regarding 
the use of the Rasch model to achieve sample-free test calibration. When the 
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Rasch model is used to calibrate a test, logarithmic ability estimates are 
assigned to every possible raw score from 1 to K-l. These scores indicate, 
the amount of ability required to achieve that score. A comparison of the 
logarithmic ability estimates assigned to a test by two samples of different 
ability should indicate the degree to which the corresponding raw score 
groups are assigned the same ability estimate by the two samples . Wright 
(1967) reports one investigation based on the responses of 976 beginning law 
students to 48 reading comprehension items on the Law School Admission Test. 

To obtain samples of different ability, Wright selected two comparison groups 
from his total sample. The "dumb group" Included the 325 students who did 
poorest on the test. The top score in this group was 23. The "smart group" 
included the 303 students with the highest scores. The lowest score in this 
group was 33, leaving a ten point difference between the smartest person in 
the "dumb group" and the dumbest person in the "smart group". The test was 
calibrated separately on the two groups and the results were presented 
graphically. Wright compared the similarity between the two sets of logarith- 
mic ability estimates and two sets of percentile ranks and concluded that the 
Rasch model does lead to sample-free test calibration while the "traditional" 
method does not. 

Anderson et al. (1968) also addressed themselves to this question. They 
correlated the ability estimates assigned to the six ability groupings on the 
basis of the CMF sample with those obtained from the RAN sample. The resulting 
product -moment correlation of .992 was interpreted as evidence that the ability 
estimate assigned to a score on a test is independent of the distribution of 
ability in the calibrating sample. 

In summary, few studies have been published on the use of the Rasch 
model in item and test calibration. The invariance of Rasch item easiness 
ratios with respect to the ability of the calibrating sample has been studied 
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by Anderson et al. (1968) , Brooks (1965) and Ranch (1960). The use of the 
Rasch model to achieve sample-free test calibration has been studied by 
Wright (1967) and Anderson et al. (1968). It is apparent that move studies 
of sample -free item and test calibration with the Rasch model remain to be 
performed before the model's usefulness can be fully assessed. 

This paper examines the application of the Rasch.model to analogy’ i terns. 
The following hypotheses were investigated: 

1. Rasch item easiness estimates are invariant with respect 
to the ability level of the calibrating sample. 

2. The higher the probabilities that th' individual items 
fit the Rasch model, the more invariant the item easiness 
estimates are with respect to the ability level of the 
calibrating sample. 

3. Rasch ability estimates, assigned in the calibration of 
a test, are invjffi&ijfc with respect to the ability level 
of the calibrating sample. 

Hypotheses 1 and 2 are tests of the invariance of the Rasch item easiness 
estimates; hypothesis 3 is a test of the invariance of the ability estimates 
assigned to a test. To provide a base line against which the invariance of 
the Rasch item easiness estimates can be compared , a conventional item 
easiness parameter— 2 item difficulty index-was also calculated and sub- 
mitted to similar tests. 

METHOD 

Selection of Item Format . Spearman's "g" or general mental ability is 
a complex, somewhat poorly defined construct which seems to be represented 
in almost all the major intelligence tests in use today. Helmstadter (1964) 
points out that tests dealing with abstract relationships (such as verbal, 
numerical, or symbolic analogies) come closest to representing what is meant 

11 
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by "g". For this reason, the analogy format was selected for study in this 
research. Guilford (1959) suggests that there are several meaningfully 
different methods of asking analogy questions. In his Structure of Intellect 
the analogy format tests the ability to "recognize relationships’*. This 
general ability can be factored into abilities at recognizing figurally., 
symbolically, semantically, and behavioral ly presented relationships, 
depending upon the type of material used to present the question. To “make 
the results as general as possible , it was decided to study f igural (picture) , 
symbolic (number and symbol) , and semantic (word) test items . Two types of 
symbolic material were used because of the intrinsic differences in the two, 
and because Guilford (1966) reports several instances in which cells in his 
Structure of Intellect contain more than one factor. 

Subjects . Data were obtained for four samples of subjects. College 

students enrolled in an introductory psychology class at the University of 

Minnesota completed 1404 test booklets. Each student was a volunteer who 

participated in the experiment to earn additional points towards his course 

grade. The students were given the option of completing 1, 2, or 3 test 

booklets, hence the exact number who participated in the experiment is not 

known. High school students enrolled in two suburban Twin Cities high 

schools completed 484 test booklets. Each student completed one test booklet. 

In both schools the test booklets were completed by students in the classes 

of those teachers who volunteered to participate in the study. Civil service 

* 

clerical employees of the City of Minneapolis completed 289 test booklets as 
part of a battery of tests. Finally, 90 clients of the Minnesota State 
Division of Vocational Rehabilitation (DVR) completed a short word analogy 
test as part of a vocational assessment test battery. 

The samples, for the most part, were similar in race, religion, and 
sex composition. The high school and college students were younger than the 
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clients and civil service employees, had fewer marital obligations, 
were better educated, and came from homes with higher family incomes, better 
educated mothers, and fathers employed in higher level occupations. In 
comparison with the high school and college students, the civil service 
employees were older, had lower family incomes, and were far more likely to 
be married and have children. The DVR clients, while heterogeneous in many 
respects, were less well educated and had lower family incomes than the high 
school and college students. 

Instruments . The four basic tests designed for use in this study were 
a 60-item word analogy test, a 60-item number analogy test, a 50-item picture 
analogy test, and a 40-item symbol analogy test. (For a discussion of the 
test construction process, see Tinsley, 1971.) None of the tests employed 
time limits although time limits were imposed by the setting in which the 
tests were administered. Because of time limitations inherent in the college 
and high school settings, it was desirable to have tests which would require 
an average of 50 to 60 minutes to complete. For this reason, the four tests 
were combined into two testr booklets. Form WS-100 contained the 60-item 
word analogy test and the 40-item symbol analogy test; form NP-110 contained 
the 60-item number analogy test and the 50-item picture analogy test. A 
fifth test designed for use with the DVR clients, form W-25, contained 25 
word analogies. This short test was administered alone in order that the 
testing time for DVR clients could be kept to an absolute minimum. 

Results on two additional tests are reported herein even though the 
data were collected for use in another study. The items of interest, 

30 picture and 30 word analogies, were presented in two different test 
booklets. Form WP-60, containing these 60 items, was administered to 
Minneapolis civil service employees. Form MNWP-110, containing these items 
plus 50 number analogies, was administered to college students. These word 
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and picture analogies had been selected in an unusual manner. The picture 
items had been selected from the picture items surviving an iterative item 
analysis procedure (for details, see Tinsley, 1971). The word analogies, 
were then constructed from the picture analogies by substituting, in the 
place of the picture, the word for the object in the picture. The resulting 
30 word analogies have undergone no formal item analysis. None of these 
word analogies appearson form WS-100. 

Each analogy item presented five alternative answers, only one of which 
was correct. Because the test booklets used in this research had been de- 
signed to be self-explanatory, examinees were simply given the test booklet 
and answer sheet and were instructed to read the directions and complete the 
test. An examiner was always available, however, to answer any questions. 

The college students were the only group to complete more than one test 
booklet. For approximately half the college students the order of admin- 
istration was WS-100, NP-110 , and MNWP-110. For the other half the order of 
administration was NP-110, MNWP-110, and WS-100. 

Analysis . Before formal analysis of the data was begun, the data were 
edited to eliminate presumably careless or slow examinees. This was accom- 
plished by el im^uating from the study any examinee who left several consec- 
utive items blank, who left blank the last few items in a test, or who left 
blank more than five items in the entire test booklet. For forms WP-60 
(administered to Minneapolis civil service employees), MNWP-110 (administered 
to college students), and W-25 (administered to DVR clients) no blank 
responses were tolerated because the forms were so short. For college 
students , 5 NP-110 and 1 MNWP-110 test booklets were eliminated. For high 
school students, 3 word tests, 14 symbol tests, 17 number tests and 42 
picture tests were not used. The higher percentage of high school students 
who failed to complete their test.booklets was due to the limited time 
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available for testing. The students were allowed only one 50 minute class 
period to complete the test booklet. Only 1 DVR client and 20 civil service 
employees failed to complete their tests. 

The scored item responses were then submitted to analysis. Calculation 
was performed using a computer program written by Wright and Panchapakesan 
(1969, 1970) and modified by Bart, Lele, and Rosse (1970) for use on the 
University of Minnesota's Control Data 6600 computer. 

The first question of interest was whether the use of the Rasch model 
leads to item easiness estimates that are invariant with respect to the 
ability of the calibrating sample. Ten tests were attempted in this study 
(see Table 1) . In each case a set of analogy items was completed by two 
samples of different ability, the two sets of data were independently sub- 
mitted to item analysis, and the product-moment correlation was calculated 
between the two sets of Rasch item easiness estimates and, for comparison 
purposes, between the two sets of Z item difficulty estimates. For the data 
to support the conclusion that item parameters are invariant with respect 
to the ability of the calibrating sample, the correlation between the two 
appropriate sets of data must approach unity . This determination was made 
by inspection of the pattern of observed correlations. 

Insert Table 1 about here 

The relationship between the "goodness-of-fit" of the item and its 

* 

invariance was also studied. First, the Rasch item easiness estimates 
derived from two groups were correlated across all items. Then those items 
which failed to fit the Rasch model for both groups at the .01 level of 
confidence were removed and the correlation was recomputed. This procedure 
was also followed using the .05, .10, .25, .30, .35, and .40 levels of 
confidence. A similar procedure was employed in investigating the relationship 
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between the Invariance of the Z item difficulty estimate and the “goodness- 
of-fit” of the P value. The criteria used in this instance were .20 P £ 
.80, .30 < P < .70, and .40 <_ P < .60. In both cases, the hypothesis was 
that the product-moment correlation between item parameters would increase 
as the criterion became more stringent. 

Finally, the invariance of the ability estimates computed for each raw 
score was investigated by computing the product -moment correlation between 
two sets of independently obtained ability estimates . 

RESULTS 

Item Calibration . Ten sets of data were collected which were relevant 
to an investigation of the invariance of Rasch item easiness and Z item 
difficulty estimates (see Table 1). In each case, independent estimates of 
the easiness of the items in the test, obtained from two samples of different 
ability, were correlated . Tables 2 and 3 indicate the results of these 
analyses . 

Insert Tables 2 and 3 about here 



In all but one comparison the correlation between independent estimates 
of Rasch item easiness differ no more than one point from the correlation 
between independent estimates of Z item difficulty. Four tests of the 
invariance of the item parameter estimates were conducted with word analogies. 
The Rasch item easiness estimates obtained from college students on a 60-item 

f 

word analogy test correlated .95 with those obtained from high school stu- 
dents (comparison I) while the item easiness estimates obtained from college 
students on a 30-item word analogy test correlated .91 with those obtained 
from civil service employees (comparison IV). At the other extreme, the 
Rasch item easiness estimates obtained from college students and high school 

students had zero correlations with those obtained from DVR clients 

O 
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(comparisons II & III) . Four tests of the invariance of the item parameter 
estimates also were conducted with picture analogies. The Rasch item 
easiness estimates obtained from college students on a 50-item test cor- 
related .97 with those obtained from high school students (comparison V), 
while the item easiness estimates obtained from college students on a 
30-item picture analogy test correlated .88 with those obtained from civil 
service employees (comparison VIII). The Rasch item easiness estimates 
obtained from college and high school students on 25-items embedded in the 
50-item picture analogy test correlated .29 and .32 respectively with the 
item easiness estimates obtained from civil service employees on those 
25-items embedded in the 30-item picture analogy test (comparisons VI & VII) 
A single comparison (X) of item parameter estimates obtained from college 
and high school students on a 40-item symbol analogy test yielded a corre- 
lation of .98 between the Rasch item easiness estimates. And, finally, a 
comparison (IX) of item parameter estimates obtained from college and high 
school students on a 60-item number analogy test resulted in correlations 
of .93 between the Rasch item easiness estimates and a correlation of .97 
between the Z item difficulty estimates. 

The above results indicate the degree to which the item parameter 
estimates are invariant when the analysis is performed on all items in the 
test. The Rasch model, however, cannot be expected to hold for items which 

do not fit the model. For this reason, the relationship between the invari 

* 

ance of the item parameter estimates and the “goodness” of the item was 

t, 

investigated. This relationship is relatively simple for the Z item 
difficulty estimates. In general, the less restrictive the range of accept 
able item difficulties, the higher the coirelation. In the six Z item 
difficulty comparisons in which correlations of .89 or higher were obtained 
(comparisons I, IV, V, VII, IX, & X) , the highest correlation is observed 
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when all items are included in the comparison and the correlation drops with 
each restriction of the range of acceptable item difficulty. In the four 
remaining comparisons (II, III, VI, & VII), the correlations fluctuate ran- 
domly with each restriction of the range of acceptable item difficulty. 

Elimination of items which did not fit the Rasch model resulted in 
increases in the correlation between Rasch item easiness estimates. However, 
the results did not follow a single pattern. Only the comparison of the 
Rasch item easiness estimates obtained from college students and civil 
service clerical employees on 30 picture analogies (comparison VIII) showed 
a steady decrease in correlation as items with lower Rasch probabilities 
were removed. Item easiness estimates obtained from high school students 
and civil service employees on 25 picture analogies (comparison VII) showed 
an initial increase in correlation when those items with Rasch probabilities 
below .01 were removed. The correlation fell to zero, however, when those 
items with Rasch probabilities below .05 were removed, and fluctuated randomly 
with subsequent deletions of items. Item easiness estimates obtained from 
college and high school students on 60 number analogies (comparison IX) 
increased in correlation when items with Rasch probabilities below .01 were 
deleted, and remained stable until after deletion of items with Rasch 
probabilities below .25. At that point, the correlation began an uninterrupted 
drop. 

The remainder of the comparisons showed some increase in correlation as 
items with low Rasch probabilities were deleted. In the comparison of item 
easiness estimates obtained from college students and civil service employees 
on 30-word analogies (comparison IV) the increase was somewhat erratic, and 
in the comparison of item easiness estimates obtained from college students 
and DVR clients on 25-word analogies (comparison II) negative correlations 
were obtained. But this latter comparison and the comparisons of college 
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and high school students on 60-word analogies (comparison I) , on 50-picture 
analogies (comparison V) > and on 40-symbol analogies (comparison X) .all 
correlated .99 when items with low Rasch probabilities were removed. 

Test Calibration . It is very rare for educational or psychological 
measurement to be made with only one item. In practice, tests of ability 
contain several items and the overall performance of the examinee is the 
basin from which generalizations about ability are made. The Rasch model 
takes account of the easiness of the items in a test in estimating the 
amount of ability indicated by raw scores on that test. It is appropriate, 
therefore, to ask whether the ability estimates assigned to test scores are 
invariant with respect ?;o the ability of the calibrating sample. In each 
of the ten cases investigated (see Table 2) , the product-moment correlation 
between the Rasch ability estimates was .999. Figure 1 illustrates the 
relationship between the ability estimates calculated for a 25-itera word 
analogy test from the responses of 630 college students and 89 DVR clients 
(comparison II) . 



Insert Figure 1 about here 



DISCUSSION 

Item Calibration . Ten tests of the invariance of Rasch item easiness 
estimates and Z item difficulty estimates were made with mixed results. 

The results are not so equivocal as they appear, however. Anderson et al. 
(1968) point out that the Rasch model does not lend itself to small samples 
Generally., samples of 500 or larger are needed to obtain stable item 
easiness (and ability) estimates. It is important, therefore, to keep the 
size of the sample in mind in interpreting the results. The comparison of 
item easiness estimates obtained from 630 college students with those 
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obtained from over 300 high school students (comparisons I & X, on 60 word 
analogies and 40 symbol analogies) yielded correlations of .95 and .98. 
Correlations of .37 and .93 were observed when the item easiness estimates 
obtained from 492 college students were compared with those obtained from 
120 high school students (comparison V on 50 picture analogies) and from 
145 high school students (comparison IX on 60 number analogies) . And the 
comparison of item easiness estimates obtained from college students and 
from 269 civil service employees on 30 word and on 30 picture analogies 
(comparisons IV & VIII) yielded correlations of .91 and .88. In contrast, 
the two comparisons involving item easiness estimates obtained from 89 DVR 
clients (comparisons II & III) resulted in zero correlations. It appears, 
therefore, that six of the comparisons of item easiness estimates made in this 
research yielded invariant item easiness estimates, especially considering 
the small sample sizes employed. Two of the four comparisons which did not 
support the hypothesis of invariant item easiness estimates are invalid 
because of the extremely small sample size. 

Two comparisons (VI & VII) remain, however, which did not support the 
hypothesis. Both were based on small samples but the samples were larger 
than samples used in some comparisons which did support the hypothesis. It 
is possible that the nature of the test was a factor in these results. Both 
comparisons involved the item easiness estimates obtained from civil service 

employees for 25 of the 30 picture analogies on form WP-60. (Form WP-60 

* 

consisted cf 30 analogies expressed in word form followed by the same 30 
analogies expressed in picture form.) It seems likely, therefore, that the 
estimates obtained from the civil service employees were contaminated by 
some factor other than ability and item difficulty. This factor might have 
been the recognition of some of the picture analogies as identical to the 
preceding word analogies. 
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Another factor which may have served to reduce the invariance of the 
item easiness estimates must be mentioned briefly. Panchapakesan (1969) 
provides a criterion for the elimination of examinees with low scores so 
that the estimation of item easiness will not be contaminated by guessing. 
According to her criterion, some of the subjects in this study should have 
been eliminated. Because of the initially small sample size, this procedure 
was not followed. It is possible, therefore, that guessing may have reduced 
the invariance of the item easiness estimates in some instances. 

In summary, six of the ten comparisons supported the hypothesis that 
the Rasch item easiness estimates were invariant with respect to the ability 
of the calibrating sample, even though a number of the comparisons involved 
samples of questionable size. Of the four remaining comparisons, two 
included samples so email as to invalidate the results while the other two 
were invalid because the Rasch model was not appropriate for tests designed 
in that manner. 

It must be noted, however, the results of the Z item difficulty 
estimates compare well with those for the Rasch item easiness estimates. 

There is no basis from these data for choosing between the two item para- 
meters. Such choice could be made on the basis of the assumptions involved 
in the two parameters. The Z item difficulty estimate requires the 
assumption that the sample is normally distributed while the Rasch item 
easiness estimate requires no assumption about the ability of the calibrating 
sample. It should be noted, parenthetically, that either the samples used 
in this study were normally distributed in terms of ability or that Z item 
difficulty estimates are robust for the assumption of normality. 

The above results represent a stringent test of the Rasch model in 
that items for which the Rasch model is clearly inappropriate were included 

in the comparison. Deletion of these items should result in an increase in 
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the correlation of the item easiness estimates obtained from different 
samples. This result was observed for five of the six valid comparisons. 

In three of these comparisons (I, V, & X) the correlation increased to .99. 
In the other two cases (comparisons IV & IX) the correlation increased at 
first and then decreased. In both such instances, the number of items 
remaining had grown so small that the lowering of the correlation may have 
resulted from a restriction of the range of item easiness estimates. Only 
the results obtained when comparing the item easiness estimates obtained 
from 269 civil service employees and from 276 college students (comparison 
VIII) for 30 oicture analogies failed to support this hypothesis. Both 
samples completed these picture items after completion of 30 word analogies 
having identical relationships. Therefore, the resulting item easiness 
estimates may have been contaminated. 

Test Calibration . It was hypothesized that Rasch ability estimates 
are invariant with respect to the ability of the calibrating sample . The 
results of each of the ten comparisons support this hypothesis. Even in 
those instances in which the samples were so small that the individual 
item easiness estimates were sample dependent, the resulting ability 
estimates were invariant. This is important because test items are almost 
always administered in groups. These results indicate that the ability 
estimates assigned to any collection of 25 or more items will be invariant 
with respect to the ability of the calibrating sample, regardless of 
whether the separate item easiness estimates were invariant or not. 

The implications of this finding and of the earlier finding of the 
invariance of the item easiness estimates , given a sufficiently large 
sample, should not be ignored. The estimation of the amount of ability 
indicated by a raw score on a test is based upon the aggregate difficulty 
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of the Items in that test. The preceding results indicate that the calcula- 
tion of the difficulty of the items and the subsequent calibration of the 
test in terms of the amount of ability represented by each raw score can be 
made from any sample. The researcher need not be concerned with the dis- 
tribution of level or ability in the calibrating sample; the calibration 
of a test is independent of these factors. 

CONCLUSIONS 

The results of this research support the following conclusions: 

1. Rasch item easiness estimates are invariant with 
respect to the ability of the calibrating sample 
when an adequate sample is employed. 

2. Invariance of the Rasch item easiness estimates is 
related to the goodness-of-fit of the items to the 
Rasch model. The deletion of items with low Rasch 
probabilities increases the invariance of the Rasch 
item easiness estimates. 

3. The estimation of the amount of ability indicated 
by the raw scores on a test is invariant with : 
respect to the ability of the calibrating sample 
for tests of 25 or more items even when relatively 
small samples are employed. . 

«! 
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Figure l 

Invariance of Rasch Ability Estimates 
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